329 research outputs found
Improving speaker turn embedding by crossmodal transfer learning from face embedding
Learning speaker turn embeddings has shown considerable improvement in
situations where conventional speaker modeling approaches fail. However, this
improvement is relatively limited when compared to the gain observed in face
embedding learning, which has been proven very successful for face verification
and clustering tasks. Assuming that face and voices from the same identities
share some latent properties (like age, gender, ethnicity), we propose three
transfer learning approaches to leverage the knowledge from the face domain
(learned from thousands of images and identities) for tasks in the speaker
domain. These approaches, namely target embedding transfer, relative distance
transfer, and clustering structure transfer, utilize the structure of the
source face embedding space at different granularities to regularize the target
speaker turn embedding space as optimizing terms. Our methods are evaluated on
two public broadcast corpora and yield promising advances over competitive
baselines in verification and audio clustering tasks, especially when dealing
with short speaker utterances. The analysis of the results also gives insight
into characteristics of the embedding spaces and shows their potential
applications
A Sequential Topic Model for Mining Recurrent Activities from Long Term Video Logs
This paper introduces a novel probabilistic activity modeling approach that mines recurrent sequential patterns called motifs from documents given as word time count matrices (e.g., videos). In this model, documents are represented as a mixture of sequential activity patterns (our motifs) where the mixing weights are defined by the motif starting time occurrences. The novelties are multi fold. First, unlike previous approaches where topics modeled only the co-occurrence of words at a given time instant, our motifs model the co-occurrence and temporal order in which the words occur within a temporal window. Second, unlike traditional Dynamic Bayesian networks (DBN), our model accounts for the important case where activities occur concurrently in the video (but not necessarily in synchrony), i.e., the advent of activity motifs can overlap. The learning of the motifs in these difficult situations is made possible thanks to the introduction of latent variables representing the activity starting times, enabling us to implicitly align the occurrences of the same pattern during the joint inference of the motifs and their starting times. As a third novelty, we propose a general method that favors the recovery of sparse distributions, a highly desirable property in many topic model applications, by adding simple regularization constraints on the searched distributions to the data likelihood optimization criteria. We substantiate our claims with experiments on synthetic data to demonstrate the algorithm behavior, and on four video datasets with significant variations in their activity content obtained from static cameras. We observe that using low-level motion features from videos, our algorithm is able to capture sequential patterns that implicitly represent typical trajectories of scene object
A Differential Approach for Gaze Estimation
Non-invasive gaze estimation methods usually regress gaze directions directly
from a single face or eye image. However, due to important variabilities in eye
shapes and inner eye structures amongst individuals, universal models obtain
limited accuracies and their output usually exhibit high variance as well as
biases which are subject dependent. Therefore, increasing accuracy is usually
done through calibration, allowing gaze predictions for a subject to be mapped
to his/her actual gaze. In this paper, we introduce a novel image differential
method for gaze estimation. We propose to directly train a differential
convolutional neural network to predict the gaze differences between two eye
input images of the same subject. Then, given a set of subject specific
calibration images, we can use the inferred differences to predict the gaze
direction of a novel eye sample. The assumption is that by allowing the
comparison between two eye images, annoyance factors (alignment, eyelid
closing, illumination perturbations) which usually plague single image
prediction methods can be much reduced, allowing better prediction altogether.
Experiments on 3 public datasets validate our approach which constantly
outperforms state-of-the-art methods even when using only one calibration
sample or when the latter methods are followed by subject specific gaze
adaptation.Comment: Extension to our paper A differential approach for gaze estimation
with calibration (BMVC 2018) Submitted to PAMI on Aug. 7th, 2018 Accepted by
PAMI short on Dec. 2019, in IEEE Transactions on Pattern Analysis and Machine
Intelligenc
What to Show? - Automatic Stream Selection among Multiple Sensors
International audienceThe installation of surveillance networks has been growing exponentially in the last decade. In practice, videos from large surveillance networks are almost never watched, and it is frequent to see surveillance video wall monitors showing empty scenes. There is thus a need to design methods to continuously select streams to be shown to human operators. This paper addresses this issue and make three main contributions: it introduces and investigates, for the first time in the literature, the live stream selection task; based on the theory of social attention, it formalizes a way of obtaining some ground truth for the task and hence a way of evaluating stream selection algorithms; and finally, it proposes a two-step approach to solve this task and compares different approaches for interestingness rating using our framework. Experiments conducted on 9 cameras from a metro station and 5 hours of data randomly selected over one week show that, while complex unsupervised activity modeling algorithms achieve good performance, simpler approaches based on amount of motion perform almost as well for this type of indoor setting
ChildPlay: A New Benchmark for Understanding Children's Gaze Behaviour
Gaze behaviors such as eye-contact or shared attention are important markers
for diagnosing developmental disorders in children. While previous studies have
looked at some of these elements, the analysis is usually performed on private
datasets and is restricted to lab settings. Furthermore, all publicly available
gaze target prediction benchmarks mostly contain instances of adults, which
makes models trained on them less applicable to scenarios with young children.
In this paper, we propose the first study for predicting the gaze target of
children and interacting adults. To this end, we introduce the ChildPlay
dataset: a curated collection of short video clips featuring children playing
and interacting with adults in uncontrolled environments (e.g. kindergarten,
therapy centers, preschools etc.), which we annotate with rich gaze
information. We further propose a new model for gaze target prediction that is
geometrically grounded by explicitly identifying the scene parts in the 3D
field of view (3DFoV) of the person, leveraging recent geometry preserving
depth inference methods. Our model achieves state of the art results on
benchmark datasets and ChildPlay. Furthermore, results show that looking at
faces prediction performance on children is much worse than on adults, and can
be significantly improved by fine-tuning models using child gaze annotations.
Our dataset and models will be made publicly available.Comment: First submitted for CVPR 2022. Current draft is in revie
Comparison of Support Vector Machine and Neural Network for Text Texture Verification
In this paper we propose a method for classifying regions of images and videos frames into text and non-text regions using support vector machine (SVM). Different features are proposed to characterise the texture formed by text characters and background. SVM has an advantage that it is insensitive to the relative numbers of training examples in positive and negative classes. This advantage is are illustrated by comparing results with those obtained using a multiple layer perceptrons (MLP)
- …